My course diary (this page) is located at https://hennake.github.io/IODS-project/.
My GitHub repository for this course is located at https://github.com/hennake/IODS-project.
Because Data Science is cool, and Open Data Science is even more so!
I’m quite familiar with R (5 years of experience), but I’d also like to learn how to use GitHub and RMarkdown.
How to get started with Git, GitHub and RMarkdown.
Working process was approved by the cat.
The dataset consists of data on the relationship between students’ learning approaches and their achievements in an introductory statistics course in Finland. More information about the original dataset is available at http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt.
After manipulation, the dataset only has the following variables:
* gender Gender: M (Male), F (Female)
* age Age (in years) derived from the date of birth
* attitude Global attitude toward statistics
* deep Average of points from questions related to deep learning approach
* stra Average of points from questions related to strategic learning approach
* surf Average of points from questions related to surface learning approach
* points Exam points
Students with zero exam points were removed from the dataset.
Data manipulation is described in detail here.
The data can be read into RStudio eg. with the read.table command as follows:
students2014 <- read.table("./data/learning2014.txt", sep="\t", dec=".", header=T)
Now the data is stored in the R object students2014.
How does the manipulated data look? A table of summary statistics is often a good starting point for an explorative analysis.
library(psych)
describe(students2014)
## vars n mean sd median trimmed mad min max range skew
## gender* 1 166 1.34 0.47 1.00 1.30 0.00 1.00 2.00 1.00 0.68
## age 2 166 25.51 7.77 22.00 23.99 2.97 17.00 55.00 38.00 1.89
## attitude 3 166 31.43 7.30 32.00 31.52 7.41 14.00 50.00 36.00 -0.08
## deep 4 166 3.68 0.55 3.67 3.70 0.62 1.58 4.92 3.33 -0.50
## stra 5 166 2.79 0.53 2.83 2.78 0.62 1.58 4.33 2.75 0.14
## surf 6 166 3.12 0.77 3.19 3.14 0.83 1.25 5.00 3.75 -0.11
## points 7 166 22.72 5.89 23.00 23.08 5.93 7.00 33.00 26.00 -0.40
## kurtosis se
## gender* -1.54 0.04
## age 3.24 0.60
## attitude -0.48 0.57
## deep 0.66 0.04
## stra -0.27 0.04
## surf -0.45 0.06
## points -0.26 0.46
A boxplot is a graphical tool for examining the distributions of individual variables. In this case, there is quite a lot of variation in the ranges of individual variables, and the picture is not as informative as one could wish.
boxplot(students2014)
We can draw a pairplot to check if there are associations between the variables.
library(GGally)
library(ggplot2)
p <- ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
In order to check the gender distribution of students, we can use the table command. It seems that women are more interested in statistics than men! (Or perhaps there were more women to whom it was mandatory to take the course…)
table(students2014$gender)
##
## F M
## 110 56
The pairplot suggests that the variable points might be slightly bimodal, but in a histogram there seems to be no severe deviation from normality.
hist(students2014$points, main="Distribution of exam points", xlab="points")
All in all, the explorative analysis tells us that
Let’s fit a regression model and check, if the suggested association between a student’s attitude and his/her exam points is statistically sound. Clearly, the exam points is the responsive variable and the attitude is an explanatory variable. Adding multiple explanatory variables into the model might reveal more associations between the variables. So let’s include also gender and surf as explanatory variables.
fit1 <- lm(points ~ gender + attitude + surf, data = students2014)
summary(fit1)
##
## Call:
## lm(formula = points ~ gender + attitude + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7179 -3.3285 0.5343 3.7412 10.9007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97982 2.40303 3.737 0.000258 ***
## genderM -0.22362 0.92477 -0.242 0.809231
## attitude 0.35100 0.05956 5.893 2.13e-08 ***
## surf 0.89107 0.54409 1.638 0.103419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared: 0.2051, Adjusted R-squared: 0.1904
## F-statistic: 13.93 on 3 and 162 DF, p-value: 3.982e-08
The attitude is the only statistically significant (p < 0.05) explanatory variable (among the three that were tested) for the exam points. The other two are not statistically significant (p > 0.05), ie. we don’t have sufficient evidence to reject the null hypothesis that the regression coefficients of gender and surf (column Estimate) differ from zero. We should note that the borderline p-value of 0.05 is only a coarse rule of thumb, and cases on the borderline should be examined more carefully.
Now we need to refit the model to remove gender and surf.
fit2 <- lm(points ~ attitude, data = students2014)
summary(fit2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The explanatory variable attitude is statistically significant (p < 0.001), so we can reject the null hypothesis that attitude has no effect on exam points. Also the intercept (constant) term of the model, which is the expected mean value of exam points when the value of attitude is zero, is statistically significant. There is only one explanatory variable left in this model, so we don’t need to worry about potential collinearity between explanators. The estimate (coefficient) of attitude (0.35) means that when the attitude increases by 1 unit, the student’s exam score is expected to increase by 0.35 units on average. The multiple R-squared value indicates that the model explains about 19 % of variation in the response variable. The adjusted R-squared value is basically the same thing, but it accounts for the phenomenon of the R2 automatically increasing when extra explanatory variables are added to the model.
We can now visualize the regression model on the observed data as a red line:
plot(students2014$attitude, students2014$points, xlim = c(0, max(students2014$attitude)), xlab="Attitude toward statistics", ylab="Exam points")
abline(fit2, col="red")
Note to self: Using the p-values of individual model coefficients is not the recommended way of model selection, and it might be a better idea to calculate eg. an AIC value for each model. However, the results are the same in this case, as a smaller AIC means a better model fit.
paste0("AIC for model 1: ", AIC(fit1))
## [1] "AIC for model 1: 1030.9779677774"
paste0("AIC for model 2: ", AIC(fit2))
## [1] "AIC for model 2: 1029.9875233202"
Before declaring that a student’s attitude toward statistics is a good predictor for his/her exam performance, we have to make sure that the critical assumptions of linear regression model are met. The assumptions of the model are:
Let’s produce a couple of diagnostic plots to check these assumptions.
par(mfrow=c(2,2))
plot(fit2, which=c(1,2,5))
The first plot (residuals vs. fitted values) shows no detectable pattern in the residuals, indicating that assumptions 3 and 4 are valid. Also the assumption 1 seems to be valid, as the residuals align quite neatly along or near the straight line in the QQ-plot. No influential outliers show up in the last plot. The assumption 2 cannot be assessed based on these diagnostic plots, but it is mostly to be worried if there is a possibility of temporal or spatial autocorrelation in the data (eg. time series or spatial data). Thus, the assumptions of the linear regression model seem to be valid for our model.
There is a statistical association between a student’s attitude toward statistics and his/her exam performance: the more positive the attitude, the better the expected performance. This association can be modeled with a linear regression model.
The joined data set used in this analysis exercise combines two student alcohol consumption data sets. The following adjustments have been made in the data wrangling exercise:
More info: Original data source and R-code for data manipulation
First, let’s read the data into R and list the variable names:
alc <- read.table("./data/alc.csv", sep=";", dec=".", header=T)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The purpose of the analysis was to study the relationships between high/low alcohol consumption and some of the other variables in the data. I chose the following variables for the analysis and set these hypotheses:
Response variable:
Explanatory variables:
Next, I limited the data used for the analysis to include only the five variables mentioned above.
library(dplyr)
dd <- select(alc, one_of(c("sex", "Pstatus", "absences", "G3", "high_use")))
str(dd)
## 'data.frame': 382 obs. of 5 variables:
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ absences: int 5 3 8 1 2 8 0 4 0 0 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ high_use: logi FALSE FALSE TRUE FALSE FALSE FALSE ...
Now we have some hypotheses produced with the Stetson-Harrison method. Will the data shoot them down straight away? Let’s see and produce some summaries and explorative plots of the variables and their associations.
First, a basic summary table:
# Basic summary table
library(psych)
describe(dd, skew=F)
## vars n mean sd min max range se
## sex* 1 382 1.48 0.50 1 2 1 0.03
## Pstatus* 2 382 1.90 0.30 1 2 1 0.02
## absences 3 382 4.50 5.46 0 45 45 0.28
## G3 4 382 11.46 3.31 0 18 18 0.17
## high_use* 5 382 NaN NA Inf -Inf -Inf NA
Sex and parents’ cohabitation status are binary factors, and their associations to high alcohol use are easy to present as crosstabulations. Absences and final grade are integer variables with a wide range of values and not so suitable for crosstabulation.
tab1 <- table(high_use=dd$high_use, sex=dd$sex)
tab2 <- table(high_use=dd$high_use, parents_status=dd$Pstatus)
tabp1 <- prop.table(tab1, 2)
tabp2 <- prop.table(tab2, 2)
list(addmargins(tab1), tabp1,
addmargins(tab2), tabp2)
## [[1]]
## sex
## high_use F M Sum
## FALSE 156 112 268
## TRUE 42 72 114
## Sum 198 184 382
##
## [[2]]
## sex
## high_use F M
## FALSE 0.7878788 0.6086957
## TRUE 0.2121212 0.3913043
##
## [[3]]
## parents_status
## high_use A T Sum
## FALSE 26 242 268
## TRUE 12 102 114
## Sum 38 344 382
##
## [[4]]
## parents_status
## high_use A T
## FALSE 0.6842105 0.7034884
## TRUE 0.3157895 0.2965116
We might visualize these associations with stacked proportional bar plots:
layout(matrix(c(1,1,2,2,3), nrow=1, ncol=5, byrow=T))
par(mai=c(0.6,0.3,0.6,0.3))
barplot(tabp1, main="Sex vs high use", col=c("#F8766D","#00BFC4"))
barplot(tabp2, main="Pstatus vs high use", col=c("#F8766D","#00BFC4"))
par(mai=c(0.7,0.1,0.7,0.1))
plot.new()
legend("topleft", legend=c("FALSE","TRUE"), fill=c("#F8766D","#00BFC4"), title="High use")
The associations between absences and alchol use, and final grade and alcohol use can be visualized more easily with boxplots:
library(ggplot2)
library(gridExtra)
g1 <- ggplot(dd, aes(x = high_use, y = G3, fill=high_use)) + ggtitle("Grade vs high use")
g2 <- ggplot(alc, aes(x = high_use, y = absences, fill=high_use)) + ggtitle("Absences vs high use")
bp3 <- g1 + geom_boxplot() + ylab("grade") + theme(legend.position="none", plot.title = element_text(hjust = 0.5))
bp4 <- g2 + geom_boxplot() + ylab("absences") + theme(legend.position="none", plot.title = element_text(hjust = 0.5))
grid.arrange(bp3, bp4, ncol=2, nrow=1)
Based on the explorative analysis, males seem to be more prone to drink heavily than females. Heavy drinkers do get on average lower grades than those who don’t drink much. Heavy drinkers are also slightly more inclined to have a high number of absences than others. However, there seems to be no association between parents’ cohabitation status and student’s alchol use. It seems that my hypotheses about sex’s, absences’ and final grade’s associations with alcohol consumption might get some support from the explorative analysis.
Let’s fit a logistic regression model to see if these findings hold.
# First model
fit1 <- glm(high_use ~ sex + Pstatus + absences + G3, family=binomial, data=dd)
summary(fit1)
##
## Call:
## glm(formula = high_use ~ sex + Pstatus + absences + G3, family = binomial,
## data = dd)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2443 -0.8411 -0.6244 1.0629 2.1686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.98141 0.61643 -1.592 0.1114
## sexM 0.98330 0.24149 4.072 4.67e-05 ***
## PstatusT 0.01709 0.40418 0.042 0.9663
## absences 0.09273 0.02316 4.003 6.24e-05 ***
## G3 -0.07570 0.03584 -2.112 0.0347 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 425.51 on 377 degrees of freedom
## AIC: 435.51
##
## Number of Fisher Scoring iterations: 4
The equation for the model is of form log(p/(1-p)) = a + b*sex + c*Pstatus + d*absences + e*G3 where p is the estimated probability for high alcohol use and a - e are the model coefficients. log(p/(1-p)) is the logit function of p. The coefficient values for factor variables (sex and Pstatus) are calculated separatedly for each unique factor level so that one level is chosen as the base level. The coefficients presented above denote the difference between the coefficient of the marked level and the base level. So the coefficient for the sex level F (chosen as the base level and omitted from the table) is actually the intercept -0.98, and the coefficient for the sex level M is -0.98141 + 0.98330 = 0.00189 (cf. explanation here).
The coefficients are easier to interpret if transformed to odd ratios (ORs) by exponantiation: OR = epx(b), where b is the original coefficient (OC). The odd ratio of a coefficient denotes the change of the odd p/(1-p), when the value of the explanatory variable changes by one unit. We can also use exponentiation and confint-function to calculate the confidence intervals for the odd ratios on the confidence level of 0.95.
data.frame(OR=round(exp(summary(fit1)$coef[,1]), 3),
CI=round(exp(confint(fit1)), 3),
"p-value"=round(summary(fit1)$coef[,4], 3))
## OR CI.2.5.. CI.97.5.. p.value
## (Intercept) 0.375 0.109 1.228 0.111
## sexM 2.673 1.675 4.325 0.000
## PstatusT 1.017 0.472 2.331 0.966
## absences 1.097 1.051 1.151 0.000
## G3 0.927 0.864 0.994 0.035
Interpretation of OR values:
p.p.We can check with a likelihood ratio test if the factor variables sex and Pstatus as a whole are statistically significant explanators in our fitted model.
anova(fit1, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: high_use
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 381 465.68
## sex 1 14.7320 380 450.95 0.0001239 ***
## Pstatus 1 0.1157 379 450.83 0.7337978
## absences 1 20.8326 378 430.00 5.012e-06 ***
## G3 1 4.4902 377 425.51 0.0340903 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems that sex, absences and final grade (G3) are statistically significant explanators for alcohol use. The directions of their effects are also as I hyphothesized: probability for high alcohol usage increases with higher number of absences, decreases with better grades, and is higher for males than females. My hypothesis about parents’ cohabitation was however wrong, as cohabitation is not a statistically significant explanator in the model.
The variable Pstatus can be omitted from the model, as it is not statistically significant. Let’s refit the model:
fit2 <- glm(high_use ~ sex + absences + G3, family=binomial, data=dd)
data.frame(OR=round(exp(summary(fit2)$coef[,1]), 3),
CI=round(exp(confint(fit2)), 3),
"p-value"=round(summary(fit2)$coef[,4], 3))
## OR CI.2.5.. CI.97.5.. p.value
## (Intercept) 0.381 0.153 0.925 0.035
## sexM 2.674 1.676 4.326 0.000
## absences 1.097 1.051 1.150 0.000
## G3 0.927 0.864 0.994 0.033
anova(fit2, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: high_use
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 381 465.68
## sex 1 14.7320 380 450.95 0.0001239 ***
## absences 1 20.8782 379 430.07 4.894e-06 ***
## G3 1 4.5585 378 425.51 0.0327561 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looks good, but how does the model perform as a binary classificator? We can use the model to predict the probability of high alcohol consumption for the individuals in the training data, and then dicotomize these probabilities to produce a binary prediction for the outcome. For this, we need to set a cutoff probability for a positive outcome (= high alcohol consumption). Choosing a sensible cutoff is an art on its own right, but let’s simply use the treshold of 0.5 (ie. the probabilities greater than 0.5 are interpreted to indicate high alcohol consumption). This is probably not a good choice, and it would be a better idea define the cutoff value with eg. a ROC curve.
# Predicting the training data
pred <- predict(fit2, type="response")
dd$prob <- pred
pred2 <- ifelse(pred > 0.5, TRUE, FALSE)
dd$pred <- pred2
Then we can produce a confusion matrix, which is a two-way crosstabulation of observed versus predicted values for the outcome variable, and visualize it.
# Confusion matrix
conf <- table(obs=dd$high_use, pred=pred2)
addmargins(conf)
## pred
## obs FALSE TRUE Sum
## FALSE 256 12 268
## TRUE 86 28 114
## Sum 342 40 382
prop.table(conf)
## pred
## obs FALSE TRUE
## FALSE 0.67015707 0.03141361
## TRUE 0.22513089 0.07329843
# Same as a plot
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(dd, aes(x = prob, y = high_use, col = pred2))
# define the geom as points and draw the plot
g + geom_point()
From the confusion matrix we can compute that the total proportion of inaccurately classified individuals (=the training error of the model) is (86 + 12) / 382 = 0.26. Is this better than a simple guessing strategy that nobody is a high user (as it is more common to not drink heavily than to drink)?
The training error of the model:
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = dd$high_use, prob = dd$pred)
## [1] 0.2565445
The training error of the simple guessing:
# Simple guessing strategy: nobody is a high user
dd$pred3 <- F
loss_func(class = dd$high_use, prob = dd$pred3)
## [1] 0.2984293
The training error of the simple guessing strategy is only slightly bigger than the error of our model, so it seems that our model is not a very succesful predictor for alchol consumption.
Above, we assessed the model performance based on the model’s ability to correctly predict the same data as on which the model was trained. This is not a good idea, as the model generally fits the training data better than any unseen data. To avoid this overestimation, we can perform cross-validation to get a more realistic estimate of the actual predictive power of the model. Let’s perform 10-fold cross-validation, which means that we divide the data into 10 complementary parts. One part forms the testing set, and the other 9 parts another subset called training set. We perform the analysis on the training set, and then validate the results on the smaller testing set. We repeat this process so that eventually all of the data is used for both training and testing, and then calculate the mean error rate of our model as an average of all repetitions.
library(boot)
cv <- cv.glm(data = dd, cost = loss_func, glmfit = fit2, K = 10)
cv$delta[1]
## [1] 0.2617801
We can see that the cross-validated error rate of our model actually is higher than the training error of the model (0.257). The test set performance of the model is about the same as of the model fitted in the DataCamp exercise.
The dataset used in this exercise contains housing data in the suburban Boston. This is a built-in dataset in the R package MASS, and can be loaded with the data(Boston) command. Let’s check the variables and the structure of the data:
# Load Boston data
library(dplyr)
library(MASS)
data(Boston)
# Structure and dimensions
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
There are 506 rows and 14 columns in the data. The dataset is described in more details in the help file.
How does the data look like? Let’s produce summaries and a graphical overview of the variables:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
All variables are numerical/integer except chas, which is a binary dummy variable indicating the location with respect to Charles River. The variables vary in scale.
# Pairplot
pairs(Boston, cex=0.1, pch=16)
Visualization of correlations between variables:
# Correlation matrix and visualization
library(tidyr)
library(corrplot)
cor_matrix<-cor(Boston)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
There seems to be a number of linear associations between the variables as shown in the pairplot, eg. between the number of rooms (rm) and the median value of owner-occupied real estates (medv), and between the proportion of owner-occupied units built prior to 1940 (age) and nitrogen oxides concentration (nox), to name but a few.
Let’s standardize the dataset for further analyses. This means that we subtract the column means from the corresponding columns and divide the difference with standard deviation. After standardization, all the (scaled) variables are centered around zero, and their unit is one standard deviation.
# Scaling
bs <- as.data.frame(scale(Boston))
summary(bs)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Let’s transform the crime rate (crim) into a categorical variable. We want to cut the variable by quantiles to get the high, low and middle rates of crime into their own categories. This is achieved by using the quantiles as break points.
# Transform 'crim' to a categorical variable
bs$crim <- cut(bs$crim, breaks = quantile(bs$crim), include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# Look at the table of the new factor crim
table(bs$crim)
##
## low med_low med_high high
## 127 126 126 127
Then we need to divide the dataset to train and test sets for futher analyses. Splitting the original data to test and train sets allows us to check how well our models work. We make the division here in such a way that 80 % on the data belongs to the train set (and 20 % to the test set).
ind <- sample(nrow(bs), nrow(bs)*0.8)
train <- bs[ind,]
test <- bs[-ind,]
Linear discriminant analysis (LDA) is a classification method that can be used to
The target variable of LDA needs to be categorical.
Let’s try LDA on the newly created categorical crime rate variable (crim) to see if we can find combinations of variables that separate different crime rate classes from each other. This means that we use all the other variables in the dataset as predictor variables. The LDA is fitted on the train dataset.
# Linear discriminant analysis
set.seed(123)
lda.fit <- lda(formula=as.numeric(crim) ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(as.numeric(crim) ~ ., data = train)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.2400990 0.2599010 0.2524752 0.2475248
##
## Group means:
## zn indus chas nox rm age
## 1 0.89388984 -0.9242221 -0.15056308 -0.8546535 0.44171325 -0.8217232
## 2 -0.07153614 -0.3005562 0.02764047 -0.5980922 -0.11521022 -0.2753269
## 3 -0.38551208 0.2379095 0.19085920 0.4264219 0.02087842 0.4392126
## 4 -0.48724019 1.0171519 -0.03610305 1.0407953 -0.38374297 0.8175499
## dis rad tax ptratio black lstat
## 1 0.8513359 -0.6953464 -0.7575195 -0.3589851 0.38319236 -0.77756659
## 2 0.3787928 -0.5542040 -0.4974968 -0.1140732 0.31702161 -0.09998732
## 3 -0.4127286 -0.4177712 -0.2964489 -0.2869448 0.06041282 0.10046503
## 4 -0.8636662 1.6377820 1.5138081 0.7803736 -0.79019485 0.93028701
## medv
## 1 0.50589410
## 2 0.01590080
## 3 0.08149811
## 4 -0.71879216
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11827100 0.64336902 -0.941913340
## indus 0.02236158 -0.32474235 0.347607538
## chas -0.06891498 -0.06991277 0.193287224
## nox 0.38797577 -0.68773674 -1.539815118
## rm -0.10784320 -0.04116682 -0.117098870
## age 0.20149052 -0.24909936 -0.010329165
## dis -0.11894277 -0.18349795 0.249202026
## rad 3.20222176 0.89308415 -0.002062512
## tax -0.02111567 0.01131337 0.593495703
## ptratio 0.12299172 0.08582481 -0.459611441
## black -0.12788607 0.01330834 0.081418545
## lstat 0.23443754 -0.21334714 0.438947623
## medv 0.19565088 -0.32336867 -0.185344867
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9494 0.0361 0.0146
We can see from the result that three linear discriminant functions were formed (number of crime categories minus one, which is the maximum possible number). These discriminant functions are linear combinations of original variables, and the coefficients for each function and variable pair can be seen in the table. From the proportion of trace we can see that the first dicriminant function LD1 explains 96 % of the between-group variance, and the other functions LD2 and LD3 explain 3.2 % and 1.1 % respectively.
We can visualize the results of LDA with a biplot. The original variables and their effects can be visualized as arrows on top of the biplot. First two LD functions are shown in the plot.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plot the lda results
classes <- as.numeric(train$crim)
plot(lda.fit, dimen = 2, col = classes)
lda.arrows(lda.fit, myscale = 2, col = "#666666")
The biplot reveals us that variable rad nearly perfectly separates the high crime rate class from other classes, as there is only one observation that mixes with lower classes. The other three lower crime rate classes don’t separate properly.
We can see how well the LDA model performs as a whole by predicting new data on the test set with it. Then we can calculate the total error of prediction by crosstabulating the results against the original data, similarly as was done when predicting data with a logistic regression model. (Note to self: there’s actually no need to remove the categorical crime rate variable from test data, but I did as was instructed.)
# save and remove categorical crime data from test data
crime <- test$crim
test <- dplyr::select(test, -crim)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = as.numeric(crime), predicted = lda.pred$class)
## predicted
## correct 1 2 3 4
## 1 19 9 2 0
## 2 3 13 5 0
## 3 2 7 14 1
## 4 0 0 0 27
We can calculate from the confusion matrix that the total error rate is about 29 % in the test data. [The confusion matrix changes on every knitting because of randomization, so the error rate given here doesn’t probably match the matrix shown above!]. In other words, the model correctly classifies 71 % of new cases, which is not a bad rate compared to the expected success rate of 25 %, if randomly guessing one of four classes. The best classification performance was achieved in the high crime rate class.
We used LDA to separate and predict known classes of individual observations. But what if we don’t know the classes in advance, but want to detect similar observations and assing them to groups or clusters based on their similarity? We can use clustering methods, eg. K-means algorithm, to achieve this.
First, we need to calculate the (dis)similarity or distance of individual observations. There are many different distance measures, but let’s use the Euclidean distance, which is the “normal” or most common distance measure.
# load MASS and Boston
data('Boston')
bs2 <- as.data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(bs2)
K-means is an unsupervised clustering method that iteratively updates the cluster memberships of individual observations so that eventually similar observations belong to same clusters. We need to decide the number of clusters we want to have before running the K-means algorithm. Let’s start with 5 clusters:
km <-kmeans(dist_eu, centers = 5)
But wait! How do we know the optimal number of clusters? One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of clusters changes.
# determine the initial number of clusters
k_max <- 10
# calculate the total within sum of squares
set.seed(123)
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b')
The optimal number of clusters is when the total WCSS drops radically, so let’s rerun K-means with 2 clusters. We can visualize the results with a pairplot, where the clusters are separated with colors.
# Optimal number of clusters is when the total of within cluster sum of squares drops drastically
# -> 2 clusters
km <-kmeans(dist_eu, centers = 2)
# plot the scaled Boston dataset with clusters
pairs(bs, col = km$cluster, cex=0.1, pch=16)
We can see from the pairplot that rad and tax are the two most informative variables to separate the two clusters from each other. Also nox seems to contain useful information for clustering.
K-means clusters as targets of LDA
# k-means again
set.seed(123)
km2 <- kmeans(dist_eu, centers = 5)
# LDA with using the k-means clusters as target classes
bs2$cl <- km2$cluster
lda.fit2 <- lda(cl ~ ., data = bs2)
plot(lda.fit2, col=as.numeric(bs2$cl), dimen=2)
lda.arrows(lda.fit2, myscale = 3, col = "#666666")
Color is defined by the categorical crime rate:
model_predictors <- dplyr::select(train, -crim)
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crim)
Color is defined by the clusters of the k-means:
train$cl <- bs2$cl[match(rownames(train), rownames(bs2))]
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=as.factor(train$cl))